gh-148762: Speed up multiline regexes anchored by ^#152339
Conversation
Signed-off-by: Harmen Stoppels <harmenstoppels@gmail.com>
Signed-off-by: Harmen Stoppels <harmenstoppels@gmail.com>
eendebakpt
left a comment
There was a problem hiding this comment.
Claude suggests adding some tests for coverage. Not sure we need all of them, but including in details here for reference.
Details
``` def test_search_anchor_at_beginning_line(self): # gh-148762: a multiline "^" search jumps between line starts. These # cases pin the behaviour the optimization must preserve. for pattern, cases in [ ('^', [ ('', [(0, 0)]), ('abc', [(0, 0)]), ('\n', [(0, 0), (1, 1)]), ('\n\n', [(0, 0), (1, 1), (2, 2)]), ('a\n', [(0, 0), (2, 2)]), # match at end after \n ('\na', [(0, 0), (1, 1)]), ('a\nb\nc', [(0, 0), (2, 2), (4, 4)]), ('a\n\nb', [(0, 0), (2, 2), (3, 3)]), # empty line ('\n\n\n', [(0, 0), (1, 1), (2, 2), (3, 3)]), ]), ('^a', [ ('a', [(0, 1)]), ('a\na', [(0, 1), (2, 3)]), ('a\nba\na', [(0, 1), (5, 6)]), ('ba\nab', [(3, 4)]), ('a\n', [(0, 1)]), # no match-at-end: needs 'a' ('\na', [(1, 2)]), ('aa\naa', [(0, 1), (3, 4)]), ('a\n\na', [(0, 1), (3, 4)]), ('a\nĀa\na', [(0, 1), (5, 6)]), # UCS2 string kind ('Ā\na\nĀ', [(2, 3)]), ('a\n\U0001F600a\na', [(0, 1), (5, 6)]), # UCS4 string kind ('\U0001F600\na', [(2, 3)]), ]), ]: p = re.compile(pattern, re.MULTILINE) for s, expected in cases: with self.subTest(pattern=pattern, string=s): self.assertEqual([m.span() for m in p.finditer(s)], expected) # bytes (8-bit) path
pb = re.compile(b'^a', re.MULTILINE)
for s, expected in [(b'a\nba\na', [(0, 1), (5, 6)]), (b'a\n', [(0, 1)]),
(b'\na', [(1, 2)]), (b'abc', [(0, 1)])]:
with self.subTest(string=s):
self.assertEqual([m.span() for m in pb.finditer(s)], expected)
# pos / endpos: the search may begin mid-line or on a line start
pa = re.compile('^a', re.MULTILINE)
self.assertEqual([m.span() for m in pa.finditer('xa\na', 1)], [(3, 4)])
self.assertEqual([m.span() for m in pa.finditer('a\na', 2)], [(2, 3)])
self.assertEqual([m.span() for m in pa.finditer('a\na\na', 1, 3)], [(2, 3)])
self.assertEqual([m.span() for m in pa.finditer('a\na', 0, 1)], [(0, 1)])
# sub / subn / split also drive search()
pc = re.compile('^', re.MULTILINE)
self.assertEqual(pc.sub('#', 'a\nb\nc'), '#a\n#b\n#c')
self.assertEqual(pc.sub('#', 'a\nb\n'), '#a\n#b\n#')
self.assertEqual(pc.subn('#', 'a\nb\n'), ('#a\n#b\n#', 3))
self.assertEqual(pc.split('a\nb'), ['', 'a\n', 'b'])
self.assertEqual(pc.split('a\nb\n'), ['', 'a\n', 'b\n', ''])
</details>
| while (ptr < end && !SRE_IS_LINEBREAK(*ptr)) | ||
| ptr++; | ||
| if (ptr >= end) | ||
| return 0; |
There was a problem hiding this comment.
This could be
| while (ptr < end && !SRE_IS_LINEBREAK(*ptr)) | |
| ptr++; | |
| if (ptr >= end) | |
| return 0; | |
| +#if SIZEOF_SRE_CHAR == 1 | |
| ptr = memchr(ptr, '\n', end - ptr); | |
| if (ptr == NULL) | |
| return 0; | |
| #else | |
| while (ptr < end && !SRE_IS_LINEBREAK(*ptr)) | |
| ptr++; | |
| if (ptr >= end) | |
| return 0; | |
| #endif |
(I did not benchmark, not sure it is worth the change)
There was a problem hiding this comment.
I had another issue/PR where I tried something like this, but it's hard to optimize when you don't know the distribution of \n character in the "haystack". See #148729 (comment); on the macbook it was hard to beat a hand-written loop:
The regression on darwin is because the letter i has a density of 2.88% in the corpus; the cross-over density is apparently about 2%, below which memchr is faster.
Based on wc -cl $(find -type f -name '*.py') from cpython's own sources, there are 988799 lines and 36153429 bytes, or a newline character density of 2.7%, meaning on my macbook the memchr would likely be slightly worse. But it depends on the use case.
So, I would hold off on combining different types of optimizations. This PR is about reducing the number of expensive match function calls.
Multiline regexes of the form
re.compile("^foo", re.MULTILINE)currentlyfall into the generic search loop, which calls
SRE(match)at everyposition in the subject string. Since a
^-anchored (SRE_AT_BEGINNING_LINE)pattern can only match at the start of the string or right after a linebreak,
we can instead jump from one line start to the next, skipping all the
intermediate positions.
Benchmarks show good improvements in runtime across UCS-1/2/4; full
numbers are in the issue.
^. #148762